Skip to content

ENH: Two-level content-addressed CastXML cache + pkl stamp fix#6486

Open
hjmjohnson wants to merge 12 commits into
InsightSoftwareConsortium:mainfrom
hjmjohnson:ci/linux-azure-disk-management
Open

ENH: Two-level content-addressed CastXML cache + pkl stamp fix#6486
hjmjohnson wants to merge 12 commits into
InsightSoftwareConsortium:mainfrom
hjmjohnson:ci/linux-azure-disk-management

Conversation

@hjmjohnson

@hjmjohnson hjmjohnson commented Jun 21, 2026

Copy link
Copy Markdown
Member

Add a two-level content-addressed CastXML cache (ITK_WRAP_CASTXML_CACHE, default ON) that
avoids re-running CastXML when the input headers and compiler flags are unchanged, plus
remove the vestigial igenerator-level cache that masked missing pkl files.

Cache design
  • L1 (no subprocess): sha256 of .cxx content + .castxml.inc flags → L2 key
  • L2 (content-only): sha256 of castxml -E preprocessor output → output.xml.gz

L2 keys are path-independent, so multiple worktrees and CI agents share one store.
Cache root: ~/.cache/itk-wrap (override via ITK_WRAP_CACHE).
ITK_WRAP_CACHE_VERBOSE=1 logs HIT/MISS per file.

igenerator cache removal

The igenerator-level LRU cache restored .i/.idx/.index.txt but silently skipped
.pkl files (names not known at cache-write time). This caused pyi_generator.py to
fail with "No pickle files were found" on incremental rebuilds. Removing the cache
eliminates the failure mode; the CastXML cache recovers the wall-clock savings.

A per-module <Module>.pkl.stamp file (new --pkl_stamp argument to igenerator.py)
is declared as a CMake OUTPUT so ninja can track pkl-file completeness without
enumerating the 1028 individual pkl paths at configure time.

Benchmark results (72-core Linux, local cache)
Build Condition Wall-clock CastXML cache ccache rate
6 warmup (fills ccache) 6m47s cold 99%
7 cold CastXML cache, warm ccache 6m39s cold (seeds) 100%
8 warm cache, same build dir 6m7s 816/816 hits 100%
9 warm cache, fresh build dir 6m29s 816/816 hits 99%
10 ninja-warm (incremental) 1s n/a 0 compilations

Same-dir savings: 32 s (8%). Cross-dir savings: 10 s (2%).
CastXML parallelizes well on 72 cores, limiting the cache speedup; igenerator
and SWIG generation dominate the remaining wall time.

Documentation

Documentation/docs/contributing/wrapping_architecture.md added:
full pipeline reference from .wrap files through configure-time generation,
CastXML, igenerator, SWIG, compilation, linking, and pyi generation.

@hjmjohnson hjmjohnson marked this pull request as ready for review June 21, 2026 23:49
@github-actions github-actions Bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots area:Python wrapping Python bindings for a class type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct labels Jun 21, 2026
@greptile-apps

This comment was marked as resolved.

Comment thread Wrapping/Generators/CastXML/itk-castxml-cache.py
Comment thread .github/workflows/python.yml
Comment thread .github/workflows/python.yml
Comment thread pyproject.toml
@github-actions github-actions Bot added the area:IO Issues affecting the IO module label Jun 22, 2026
@hjmjohnson hjmjohnson force-pushed the ci/linux-azure-disk-management branch from 3400645 to 20b1c6e Compare June 22, 2026 03:08
@github-actions github-actions Bot removed the area:IO Issues affecting the IO module label Jun 22, 2026
@hjmjohnson

Copy link
Copy Markdown
Member Author

/azp run ITK.macOS.Python

@github-actions github-actions Bot added the area:IO Issues affecting the IO module label Jun 22, 2026
@hjmjohnson hjmjohnson force-pushed the ci/linux-azure-disk-management branch from fba93ba to 47a29b7 Compare June 22, 2026 18:49
@github-actions github-actions Bot removed the area:IO Issues affecting the IO module label Jun 22, 2026
hjmjohnson and others added 12 commits June 23, 2026 05:34
Ubuntu-22.04 and ubuntu-24.04 hosted agents ship Android SDK (~9 GB),
Haskell/GHCup (~5 GB), .NET (~2-3 GB), Swift (~1.5 GB), CodeQL (~2 GB),
and Boost headers (~1.2 GB). ITK's Linux builds use none of these;
removing them at job start recovers ~20 GB before checkout, ccache
restore, and the build itself consume disk.
Add ITK_WRAP_CASTXML_CACHE option (default OFF).  Wraps castxml
with a two-level cache:

  L1 (no subprocess): sha256 of binary content-hash + inc + cxx
  L2 (content-only):  sha256 of castxml -E output, markers stripped

L1 hit restores gzip-compressed XML with no castxml process.
L2 keys are path-independent; worktrees share the same store.
Binary fingerprinted by content hash so ninja -t clean reuses L1.
LRU eviction via background fork; 2 GiB cap (ITK_WRAP_CACHE_MAX_SIZE).

igenerator.py gains matching LRU eviction and bypass flag.
…cache

Extend ITK_WRAP_CACHE to a colon-separated list of roots (like PATH).
Reads search each root in order; writes go to the first that accepts an
atomic rename.  A read-only shared NFS cache can follow a writable SSD:
  export ITK_WRAP_CACHE=/local/ssd/cache:/nfs/lab/shared-cache
Students get L2 hits from the shared cache while storing L1 maps locally.

Add ITK_WRAP_CACHE_FORMAT=uncompressed: stores plain XML and restores
via os.link() when cache and build share a filesystem, so A/B/C/D test
builds each cost one L2 inode rather than N copies.  Falls back to
shutil.copy2() on cross-device links.  gzip remains the default.

Unlink output_xml before a full castxml run to sever any prior hardlink
to the L2 store so castxml cannot corrupt a shared inode.
The constant is a key-algorithm version salt, not a storage format
descriptor.  Renaming clarifies that it belongs to the hash key
computation and should not change when the storage format changes.
…E to ON

Remove the hardlink restore path from _restore_xml() — shutil.copy2()
is sufficient; disk space is not constrained enough to justify the
POSIX-only os.link() complexity and cross-device fallback.  gzip
remains the default storage format (~253 MB for a full 807-module
build vs 2.2 G uncompressed).

Default ITK_WRAP_CASTXML_CACHE to ON so new build directories benefit
from cross-dir L2 sharing without manual configuration.  The cache
location defaults to ~/.cache/itk-wrap; CI overrides via ITK_WRAP_CACHE.
Add .github/workflows/python.yml (ITK.Pixi.Python) to run the Python
wrapping build on ubuntu-24.04, windows-2022, and macos-15.  Mirror
the ccache persistence pattern from Pixi-Cxx: restore before configure,
save (if !cancelled) after build.

Add a second castxml-v1 cache restore/save pair pointing at
${{ runner.temp }}/itk-castxml-cache, passed to the build via
ITK_WRAP_CACHE.  On a cold run the cache is seeded; on a warm run
castxml is skipped for all 807 wrapped types — measured 6m37s vs 9m30s
on a 72-core machine, larger speedup expected on 4-core CI runners
where castxml is on the critical path.

Add configure-python-ci, build-python-ci, and test-python-ci pixi
tasks that mirror their non-CI counterparts but pass
-DITK_WRAP_CASTXML_CACHE:BOOL=ON explicitly.
Add ITK_WRAP_CACHE pipeline variable and a Cache@2 restore task
(castxml-v1 key) to ITK.Linux.Python, ITK.macOS.Python, and
ITK.Windows.Python.  The Cache@2 task mirrors the existing ccache
pattern: restore before the build step, Azure DevOps automatically
saves on post-job when the path is non-empty.

ITK_WRAP_CASTXML_CACHE defaults to ON (set in itkWrapCastXMLCacheSupport.cmake),
so the cache is active without any dashboard.cmake change.
Wrapping/CMakeLists.txt: include(itkWrapCastXMLCacheSupport) so
ITK_WRAP_CASTXML_CACHE_SCRIPT is set for the condition guard in
itk_auto_load_submodules.cmake; guarded by ITK_WRAP_PYTHON.

python.yml: exclude windows-2022; itk_end_wrap_module.cmake
produces an igenerator command exceeding cmd.exe's 8191-char
batch-file line limit for large modules such as
ITKImageIntensity (59 submodules). Pre-existing issue, unrelated
to the castxml cache changes.

Assisted-by: Claude Code — root-cause: missing include and Windows batch-file limit
Invalidates all existing v3 L2 entries (different hash prefix → different
path → orphaned, pruned by LRU eviction) so the next build seeds fresh
timing data for the 5-build overnight benchmark protocol.

Co-Authored-By: Hans Johnson <hans.j.johnson@gmail.com>
The igenerator cache (ITK_IGENERATOR_CACHE / ~/.cache/itk-igenerator) was
an incomplete implementation that saved .i/.idx/SwigInterface.h files but
never saved .pkl files. On a warm cache hit _igenerator_restore() returned
early, leaving the itk-pkl/ directory empty. pyi_generator then failed with
"No pickle files were found".

The itk-castxml-cache already covers the expensive CastXML step. igenerator
itself is fast once the XML is available, so a separate layer adds complexity
without benefit. Remove the six cache functions and their two call sites in
main() entirely, restoring the original clean architecture.
igenerator.py writes N pkl files per module as side effects that ninja
cannot track because their names (ClassName.SubmoduleName.pkl) are not
enumerable at CMake configure time.  When pkl files are deleted while
the .index.txt byproducts survive, ninja considers igenerator up-to-date
and pyi_generator.py fails with "No pickle files were found."

Add a --pkl_stamp argument to igenerator.py.  The stamp is written after
all pkl files for the module are complete and is declared as a CMake
OUTPUT of the igenerator add_custom_command.  Ninja now re-runs
igenerator whenever the stamp is absent, which guarantees the pkl files
are regenerated before pyi_generator.py reads the .index.txt manifests.
Documents the two-phase pipeline (CMake configure → Ninja build) that
converts .wrap files into .abi3.so modules and .pyi stubs.  Covers:

- Configure phase: how .wrap macros produce the three files written to
  castxml_inputs/ (.cxx, .castxml.inc, SwigInterface.h.in)
- Build phase: CastXML (816 independent jobs) → igenerator.py (96
  per-module jobs, no global barrier) → SWIG/compile/link → pyi_generator
- Key file reference table mapping each file to its writer and reader
- CastXML cache and ccache summary
- Ninja dependency graph in ASCII
- Troubleshooting section for the two most common failure modes
@hjmjohnson hjmjohnson force-pushed the ci/linux-azure-disk-management branch from 47a29b7 to cbcd65a Compare June 23, 2026 10:39
@github-actions github-actions Bot added the area:Documentation Issues affecting the Documentation module label Jun 23, 2026
@hjmjohnson hjmjohnson changed the title WIP: CI TESTING ENH: Two-level CastXML/igenerator build cache + Python CI workflow ENH: Two-level content-addressed CastXML cache + pkl stamp fix Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Documentation Issues affecting the Documentation module area:Python wrapping Python bindings for a class type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant